Annealed SMC for Breast Cancer Classification

Author

Ravleen Bajaj

Annealed Sequential Monte Carlo for Bayesian Logistic Regression

This analysis applies Annealed Sequential Monte Carlo (SMC) to perform Bayesian inference on the Wisconsin Breast Cancer dataset. We estimate posterior distributions for logistic regression coefficients that predict malignant vs benign tumors.

Setup

import Pkg
Pkg.add(["CSV", "DataFrames", "Printf", "Plots", "StatsBase", "HTTP", "Distributions"])

include("dataset.jl")
include("smc.jl")

using .SMC
using Random
using Plots
using StatsBase
using Statistics
using DataFrames
using Printf

Random.seed!(42)
   Resolving 
package versions...

  No Changes to `~/projects/smc-breastcancer/Project.toml`

  No Changes to `~/projects/smc-breastcancer/Manifest.toml`
TaskLocalRNG()

Load and Explore Data

# Load the breast cancer dataset
X, y = load_breast_cancer()

println("Dataset dimensions: $(size(X))")
println("Number of features: $(size(X, 2))")
println("Number of samples: $(size(X, 1))")
println("\nClass distribution:")
println("  Benign (0): $(sum(y .== 0))")
println("  Malignant (1): $(sum(y .== 1))")
Downloading dataset from URL...
Dataset shape: (699, 11)

✓ Dataset loaded successfully!
  Samples: 699
  Features: 9
  Benign (0): 458
  Malignant (1): 241
Dataset dimensions: (699, 9)
Number of features: 9
Number of samples: 699

Class distribution:
  Benign (0): 458
  Malignant (1): 241
# Feature statistics
feature_names = ["Clump Thickness", "Uniformity Cell Size", "Uniformity Cell Shape",
                 "Marginal Adhesion", "Single Epithelial Cell Size", "Bare Nuclei",
                 "Bland Chromatin", "Normal Nucleoli", "Mitoses"]

df_summary = DataFrame(
    Feature = feature_names,
    Mean = vec(mean(X, dims=1)),
    Std = vec(std(X, dims=1)),
    Min = vec(minimum(X, dims=1)),
    Max = vec(maximum(X, dims=1))
)

println("\nFeature Summary Statistics:")
show(df_summary, allrows=true)
Feature Summary Statistics:

9×5 DataFrame

 Row  Feature                      Mean     Std      Min      Max     

      String                       Float64  Float64  Float64  Float64 

─────┼─────────────────────────────────────────────────────────────────

   1 │ Clump Thickness              4.41774  2.81574      1.0     10.0

   2 │ Uniformity Cell Size         3.13448  3.05146      1.0     10.0

   3 │ Uniformity Cell Shape        3.20744  2.97191      1.0     10.0

   4 │ Marginal Adhesion            2.80687  2.85538      1.0     10.0

   5 │ Single Epithelial Cell Size  3.21602  2.2143       1.0     10.0

   6 │ Bare Nuclei                  3.48641  3.62193      1.0     10.0

   7 │ Bland Chromatin              3.43777  2.43836      1.0     10.0

   8 │ Normal Nucleoli              2.86695  3.05363      1.0     10.0

   9 │ Mitoses                      1.58941  1.71508      1.0     10.0

Run Annealed SMC

We use Sequential Monte Carlo with annealing to sample from the posterior distribution of logistic regression coefficients. The algorithm gradually transitions from the prior (β=0) to the posterior (β=1).

# SMC hyperparameters
N_particles = 500
mcmc_steps = 5
step_scale = 0.2
ess_threshold = 0.5

println("\nRunning Annealed SMC...")
println("  Particles: $N_particles")
println("  MCMC steps per iteration: $mcmc_steps")
println("  Step scale: $step_scale")
println("  ESS threshold: $ess_threshold")

# Run the sampler
@time particles, particle_weights, betas, acc_hist = SMC.annealed_smc(
    X, y;
    N=N_particles,
    mcmc_steps=mcmc_steps,
    step_scale=step_scale,
    ess_frac=ess_threshold
)

println("\nSMC completed!")
println("  Total annealing steps: $(length(betas))")
println("  Final β: $(betas[end])")
println("  Mean acceptance rate: $(round(mean(acc_hist), digits=3))")

Running Annealed SMC...
  Particles: 500
  MCMC steps per iteration: 5
  Step scale: 0.2
  ESS threshold: 0.5
1297.647315 seconds (1.53 G allocations: 1.103 TiB, 5.02% gc time, 0.05% compilation time)

SMC completed!
  Total annealing steps: 29387
  Final β: 1.0
  Mean acceptance rate: 0.854

Annealing Schedule

plot(1:length(betas), betas,
     xlabel="Iteration",
     ylabel="β (Temperature)",
     title="Annealing Schedule",
     legend=false,
     linewidth=2,
     marker=:circle,
     markersize=3,
     grid=true)
hline!([0.0, 1.0], linestyle=:dash, color=:gray, alpha=0.5)

Annealing schedule showing the progression of temperature parameter β from 0 (prior) to 1 (posterior)

The annealing schedule adaptively chooses temperature increments to maintain effective sample size (ESS) above the threshold.

MCMC Acceptance Rate

plot(1:length(acc_hist), acc_hist,
     xlabel="Iteration",
     ylabel="Acceptance Rate",
     title="MCMC Acceptance Rate per Iteration",
     legend=false,
     linewidth=2,
     color=:green,
     grid=true,
     ylim=(0, 1))
hline!([0.234], linestyle=:dash, color=:red, 
       label="Optimal (0.234)", alpha=0.5)

Metropolis-Hastings acceptance rates throughout the annealing process

Acceptance rates around 0.2-0.4 indicate good mixing in the MCMC moves.

Posterior Analysis

Posterior Mean Estimates

# Define feature names
feature_names = ["Cl.thickness", "Cell.size", "Cell.shape", 
                 "Marg.adhesion", "Epith.c.size", "Bare.nuclei",
                 "Bl.cromatin", "Normal.nucleoli", "Mitoses"]

# Compute weighted posterior statistics
posterior_mean = vec(sum(particles .* particle_weights, dims=1))
posterior_var = vec(sum((particles .- posterior_mean').^2 .* particle_weights, dims=1))
posterior_std = sqrt.(posterior_var)

df_posterior = DataFrame(
    Feature = feature_names,
    Mean = posterior_mean,
    Std = posterior_std,
    Lower_95 = posterior_mean .- 1.96 .* posterior_std,
    Upper_95 = posterior_mean .+ 1.96 .* posterior_std
)

println("\nPosterior Summary:")
show(df_posterior, allrows=true)
Posterior Summary:

9×5 DataFrame

 Row  Feature          Mean       Std        Lower_95     Upper_95   

      String           Float64    Float64    Float64      Float64    

─────┼────────────────────────────────────────────────────────────────

   1 │ Cl.thickness     -0.337383  0.0555917  -0.446342    -0.228423

   2 │ Cell.size         0.940135  0.130293    0.68476      1.19551

   3 │ Cell.shape        0.192543  0.100826   -0.00507606   0.390161

   4 │ Marg.adhesion     0.111208  0.0763442  -0.0384265    0.260843

   5 │ Epith.c.size     -0.822913  0.109199   -1.03694     -0.608882

   6 │ Bare.nuclei       0.569012  0.0594479   0.452494     0.68553

   7 │ Bl.cromatin      -0.5293    0.0925939  -0.710784    -0.347815

   8 │ Normal.nucleoli   0.333988  0.0707013   0.195414     0.472563

   9 │ Mitoses          -0.230586  0.0928345  -0.412542    -0.0486306

Interpretation

Positive coefficients increase the log-odds of malignancy:

# Identify "significant" features (95% CI excludes zero)
significant = (df_posterior.Lower_95 .> 0) .| (df_posterior.Upper_95 .< 0)

println("\nFeatures with credible intervals excluding zero:")
for (i, feat) in enumerate(feature_names)
    if significant[i]
        direction = posterior_mean[i] > 0 ? "increases" : "decreases"
        println("  • $feat: $(direction) malignancy risk")
        println("    Coefficient: $(round(posterior_mean[i], digits=3)) ± $(round(posterior_std[i], digits=3))")
    end
end

Features with credible intervals excluding zero:
  • Cl.thickness: decreases malignancy risk
    Coefficient: -0.337 ± 0.056
  • Cell.size: increases malignancy risk
    Coefficient: 0.94 ± 0.13
  • Epith.c.size: decreases malignancy risk
    Coefficient: -0.823 ± 0.109
  • Bare.nuclei: increases malignancy risk
    Coefficient: 0.569 ± 0.059
  • Bl.cromatin: decreases malignancy risk
    Coefficient: -0.529 ± 0.093
  • Normal.nucleoli: increases malignancy risk
    Coefficient: 0.334 ± 0.071
  • Mitoses: decreases malignancy risk
    Coefficient: -0.231 ± 0.093

Posterior Distributions

Coefficient Posterior Densities

# Create subplots for each coefficient
plots_array = []

for (i, feat) in enumerate(feature_names)
    # Get samples for this feature
    samples = particles[:, i]
    
    # Create weighted histogram
    p = histogram(samples, weights=particle_weights,
                 bins=30,
                 normalize=:pdf,
                 xlabel="Coefficient Value",
                 ylabel="Density",
                 title=feat,
                 legend=false,
                 alpha=0.7,
                 color=:steelblue)
    
    # Add posterior mean
    vline!([posterior_mean[i]], 
           linewidth=2, 
           color=:red, 
           linestyle=:solid)
    
    # Add credible interval
    vline!([df_posterior.Lower_95[i], df_posterior.Upper_95[i]], 
           linewidth=1.5, 
           color=:orange, 
           linestyle=:dash)
    
    # Add zero line
    vline!([0], linewidth=1, color=:gray, linestyle=:dot)
    
    push!(plots_array, p)
end

plot(plots_array..., layout=(5, 2), size=(1000, 1200))

Posterior distributions for each logistic regression coefficient

Top Features by Effect Size

# Rank features by absolute posterior mean
effect_sizes = abs.(posterior_mean)
ranked_idx = sortperm(effect_sizes, rev=true)

println("\nTop 5 Features by Effect Size:")
for (rank, idx) in enumerate(ranked_idx[1:5])
    println("$rank. $(feature_names[idx])")
    println("   Coefficient: $(round(posterior_mean[idx], digits=3))")
    println("   Std Dev: $(round(posterior_std[idx], digits=3))")
    println("   95% CI: [$(round(df_posterior.Lower_95[idx], digits=3)), $(round(df_posterior.Upper_95[idx], digits=3))]")
    println()
end

Top 5 Features by Effect Size:
1. Cell.size
   Coefficient: 0.94
   Std Dev: 0.13
   95% CI: [0.685, 1.196]

2. Epith.c.size
   Coefficient: -0.823
   Std Dev: 0.109
   95% CI: [-1.037, -0.609]

3. Bare.nuclei
   Coefficient: 0.569
   Std Dev: 0.059
   95% CI: [0.452, 0.686]

4. Bl.cromatin
   Coefficient: -0.529
   Std Dev: 0.093
   95% CI: [-0.711, -0.348]

5. Cl.thickness
   Coefficient: -0.337
   Std Dev: 0.056
   95% CI: [-0.446, -0.228]

Particle Cloud Visualization

# Plot top 2 features
idx1, idx2 = ranked_idx[1:2]

scatter(particles[:, idx1], particles[:, idx2],
        markersize=particle_weights .* length(particle_weights) .* 3,
        alpha=0.5,
        xlabel=feature_names[idx1],
        ylabel=feature_names[idx2],
        title="Joint Posterior: Top 2 Features",
        legend=false,
        color=:steelblue)

# Add posterior means
scatter!([posterior_mean[idx1]], [posterior_mean[idx2]],
         markersize=10,
         color=:red,
         marker=:star,
         label="Posterior Mean")

Particle cloud showing joint posterior for top two features

Model Diagnostics

Effective Sample Size History

# Compute ESS at each iteration (approximate from acceptance rates)
println("\nEffective Sample Size Diagnostics:")
println("  Number of particles: $N_particles")
println("  ESS threshold: $(ess_threshold * N_particles)")
println("  Resampling triggered: $(length(betas) - 1) times")

Effective Sample Size Diagnostics:
  Number of particles: 500
  ESS threshold: 250.0
  Resampling triggered: 29386 times

Particle Weight Distribution

histogram(particle_weights,
          bins=30,
          xlabel="Weight",
          ylabel="Frequency",
          title="Final Particle Weight Distribution",
          legend=false,
          color=:purple,
          alpha=0.7)

# Add ESS calculation
final_ess = 1 / sum(particle_weights.^2)
println("\nFinal Effective Sample Size: $(round(final_ess, digits=1))")
println("ESS / N: $(round(final_ess / N_particles, digits=3))")

Final Effective Sample Size: 250.0
ESS / N: 0.5

Predictive Performance

# Make predictions using posterior mean
X_with_intercept = hcat(ones(size(X, 1)), X)
# Note: If your features don't include intercept, adjust accordingly

# For simplicity, use posterior mean coefficients
eta = X * posterior_mean
probs = 1 ./ (1 .+ exp.(-eta))
y_pred = Float64.(probs .> 0.5)

# Compute accuracy
accuracy = mean(y_pred .== y)
println("\nPredictive Performance (Posterior Mean):")
println("  Accuracy: $(round(accuracy * 100, digits=2))%")

# Confusion matrix
tp = sum((y_pred .== 1) .& (y .== 1))
tn = sum((y_pred .== 0) .& (y .== 0))
fp = sum((y_pred .== 1) .& (y .== 0))
fn = sum((y_pred .== 0) .& (y .== 1))

println("\nConfusion Matrix:")
println("              Predicted")
println("              Neg    Pos")
println("Actual Neg    $tn     $fp")
println("       Pos    $fn     $tp")

# Metrics
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
precision = tp / (tp + fp)

println("\nMetrics:")
println("  Sensitivity (Recall): $(round(sensitivity, digits=3))")
println("  Specificity: $(round(specificity, digits=3))")
println("  Precision: $(round(precision, digits=3))")

Predictive Performance (Posterior Mean):
  Accuracy: 86.55%

Confusion Matrix:
              Predicted
              Neg    Pos
Actual Neg    402     56
       Pos    38     203

Metrics:
  Sensitivity (Recall): 0.842
  Specificity: 0.878
  Precision: 0.784

Summary

This analysis demonstrated Bayesian logistic regression using Annealed Sequential Monte Carlo on the Wisconsin Breast Cancer dataset. Key findings:

  1. Annealing Schedule: The adaptive temperature schedule required $(length(betas)) iterations to transition from prior to posterior.

  2. Key Predictive Features: The features with strongest effects are shown above, with credible intervals excluding zero indicating statistical significance in the Bayesian sense.

  3. Model Performance: The posterior mean classifier achieves $(round(accuracy * 100, digits=1))% accuracy on the training data.

  4. Posterior Uncertainty: The posterior distributions show varying degrees of uncertainty across features, with some coefficients more precisely estimated than others.

Next Steps

  • Cross-validation: Implement k-fold CV to assess out-of-sample performance
  • Model comparison: Compare with simpler models using marginal likelihood
  • Feature selection: Use posterior distributions to identify sparse models
  • Sensitivity analysis: Test robustness to prior specifications